Data Obtain
Initial Data Scrape with rvest
Men’s world records scrape.
url <- "https://www.worldathletics.org/records/by-category/world-records"
mens_world_records <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="menoutdoor"]/table') %>%
html_table()
mens_world_records <- mens_world_records[[1]]Women’s world records scrape.
womens_world_records <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="womenoutdoor"]/table') %>%
html_table()
womens_world_records <- womens_world_records[[1]]https://en.wikipedia.org/wiki/Athletics_record_progressions for later bigger data obtain on each event.
Data Cleaning
mens_world_records <- mens_world_records %>%
select(Discipline, Perf, Competitor, DOB, Country, Date) %>%
mutate(DOB = lubridate::dmy(DOB),
Date = lubridate::dmy(Date)) %>%
mutate(Perf = gsub("h #", "", Perf),
Perf = str_remove(Perf, "h"),
Perf = str_remove(Perf, " *"),
Perf = str_remove(Perf, "\\*"),
Perf = str_remove(Perf, " (i)"),
Discipline = factor(Discipline),
Country = factor(Country))
womens_world_records <- womens_world_records %>%
select(Discipline, Perf, Competitor, DOB, Country, Date) %>%
mutate(DOB = lubridate::dmy(DOB),
Date = lubridate::dmy(Date)) %>%
mutate(Perf = gsub("h #", "", Perf),
Perf = gsub(" #", "", Perf),
Perf = str_remove(Perf, "h"),
Perf = str_remove(Perf, "Wo *"),
Perf = str_remove(Perf, "Wo"),
Perf = str_remove(Perf, " Mx"),
Perf = str_remove(Perf, "\\*"),
Perf = str_remove(Perf, " (i)"),
Discipline = factor(Discipline),
Country = factor(Country))
womens_world_records$Perf[8] <- "5:23.75" # Can't tell why (i) is not getting removed
# xlsx::write.xlsx(womens_world_records, "womens_world_records.xlsx")Basic Plotting
Running World Records
Separated into Track, Middle Distance, and Long Distance
Plotting Men’s and Women’s World Records Together.
Split into smaller sections once again
Percent Difference At Each Level, by this metric it appears that the most impressive world records are the men’s 20,000 through 30,000 meters. This could also just be the result of these races being largely unimportant in the racing world, as evidenced by the jump in the ratio of the half marathon.
Progression of the 100 meter world record
The vast majority of world records have been broken at positive wind speeds, is this because the races are run in non-windy conditions, or wind speed has a drastic effect on runners? Given how minor the negative wind speeds are when it is broken, it is likely that this is because it is having a significant impact.
Percent Improvement for World Record Each Time
Usain Bolt is absolutely bonkers good, no one else has ever made such an individual dent into the 100 meter world record.
Next step is to add an animation of each world record over time, need to get all that data first.